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Abstract: This work focuses on clustering a site into groups of documents that are predictive of future user accesses. Two 



approaches have been developed and tested. The first approach uses semantic information inherent in the documents to facilitate 



the clustering process. User access history is then used to reorganize the clusters iteratively so as to better indicate access patterns. 
This method was found to not be an effective solution to the problem. Hence, a second method based on hierarchical clustering of 
00 trail information was developed. This method is shown to be far more effective than the first method. 




1 Introduction 



With the rapid proliferation of websites on the Internet over the past few years, it has become imperative for websites to 
enhance the quality of service that they provide in order to attract and sustain user traffic. The average user is interested 
only in a limited subset of the available content at a website. The emphasis therefore should be on developing tools 
that aid the user select that subset (automatic customization of hyperlink presentation order, for example). Such a 
strategy warrants predicting a user’s actions based on past user-activity at the website. 

One way to facilitate prediction would be to develop a model for user access patterns. The assumption is that 
patterns exist in aggregate user access histories that allow the behavior of one user to be predicted based on the behavior 
of previous users. The first step towards modeling user access patterns is modeling the site. Site modeling involves 
organizing and grouping the pages (or documents) present in the site. A variety of criterion can be used, atleast in 
theory, to group the documents available on the web server. These criterion can be placed under two broad categories: 
(a) organization based on the content of documents ([Green, 1998, Weiss etal., 1996, Fowler et al., 1996]) and (b) 
organization based on the access history of documents ([Joshi and Krishnapuram, 1998, Perkowitz and Etzioni, 1997, 
Mobasher et al., 1996]). In this paper we describe two approaches that we have developed to induce clustering of 
documents. 
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2 The World Cup 1998 Server Log Data 

The test data for this research consists of 14 weeks of server logs from the 1998 World Cup Soccer site (http://www. 
france98.org). The data was collected during the period of the 1998 World Cup games. The server logs (the user 
access history for 14 weeks) were provided by Hewlett-Packard Labs ([ Arlitt and Jin, 1999]). The following table lists 
some basic statistics about this data set. 



Number of weeks 


14 


Total page requests 


1,350, 004,229 


Number of distinct IP addresses 


2, 769, 788 


HTML requests 


38.59% 


Image requests 


35,03% 


Other requests (audio, video, etc.) 


26,38% 



The site is structured as a bilingual English/French site. We have considered the English content only. The trail 
data is very large, so we have focused on two representative weeks for our initial tests - weeks 3 and 7, a medium and 
a large traffic week. 
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3 Problem Definition 



Let D = {d\,di, . . , ,d N ] refer to the set of N HTML documents present in the server. The user access history can 
then be expressed as, L = {< rri\,t\,d\ >,< m 2 , < 2,^2 >}, where rrij indicates the source of 

the request (an IP address or a machine identifier), tj is the server access time, and dj is the requested document. 
Processing this information results in a set S = (Si, S 2 , • • ■ , Si} of user sessions. Each session Sj = {dj\,dji , . . . d jk } 
indicates the order in which documents were requested by a single machine. 

The goal in this work is to identify K document clusters such that user sessions span the minimum number of 
clusters. Thus we have to identify K clusters, C - {ci,C2,. . . , ck}, with each C{ C D, such that the number of inter- 
cluster transitions ( ICT ), normalized with respect to the total number of transitions, is minimized. A transition is 
simply two sequential accesses in a user session. The inter-cluster transition criteria is defined in Eq. 1. 

^ ^2jm 1 l(< 1 >6 Tj : dji (E Cp,dj t i+ 1 6 c q ,Cp 7 ? c q }\ 

Id(K) ( 1 ) 

El!! 1^1-1 

Here Tj represents all transitions in session Sj . The choice of K does, of course, have a significant influence 
on this formulation and has a major affect on the performance of any system that utilizes clustering for predictive 
purposes. In our current formulation, we have chosen K empirically. Even for a fixed value of K , the problem of 
finding the optimal value of ICT(K) is practically intractable [Jain and Dubes, 1988]. Thus, we apply heuristics to 
perform the above clustering. 

4 Method 1: Clustering Using Semantic and Trail Information 

In addition to the user trail information, document content is also indicative of relationships among site pages. Conse- 
quently, we chose to initiate a study of the relationship of semantic clustering and trail clustering. Semantic clustering 
can be used to “bootstrap” access pattern clustering by providing an initial grouping of pages that can then be itera- 
tively reclustered with increasing influence by the trail information. Also, semantic clustering provides a mechanism 
for insertion of new documents into document sets prior to the availability of trail information (document routing - 
[Hull etal, 1996]). 

The documents at the server site are first clustered ( K clusters) according to their semantic content (words in 
the documents). This semantic clustering consists of the following two steps: (1) Word Clustering : This involves 
extracting unique words from all documents and using them as features. Since the number of words in a document 
and hence, the entire set of documents at the server can become huge, we reduce the number of features. A number 
of methods have been described in the literature for reducing the dimensionality of the feature space (such as singular 
value decomposition (SVD) [Deerwester et al., 1990] and feature clustering [Wulfekuhler and Punch, 1997]). We use 
the feature clustering method, since it is fast and has been shown to be an efficient dimensionality reduction method. 
(2) Document Clustering : The N documents are then partitioned into K clusters using the new, reduced, feature set. 

4,1 Word Clustering 

Let W = {w\,W 2 , wm } be the set of unique words (after stemming the words, removing stop words and combining 
words that always occur together) extracted from the N documents. The pattern matrix (with the words acting as 
features) for the N documents can be represented as F& = [f(d\) /(cfc) . . . /(<2 n)] £ , where /(<£*) = [ 6 ] b l 2 . . . b % M ] 1 
and b l j = 1, if the word Wj occurs in di and b} = 0, otherwise. F& is an N x M matrix. Since the value of M can 
be very large, we first cluster the words into a smaller subset prior to clustering the documents. In order to do so, 
we cluster the inverted pattern matrix (Fp, where each row now represents a word (and the N columns represent the 
documents) into M 1 clusters ( M ' << M). In our current implementation, M' is chosen empirically (500 in early 
experiments). We employ the if-Means clustering algorithm, with a normalized cosine-measure as the dissimilarity 
measure between two feature vectors. The feature clustering yields a new set of words, W' = {w \ , vj' 2 , }, 

where \ W'\ = M'. The new pattern matrix (for the N documents) is defined as F’ D = [f'(d\) /'(cfe) • • • /'Mw)^ 
where F^(i,j) = 1 if any of the words assigned to cluster Wj occur in D{. 
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4.2 Document Clustering 

Having clustered the words, we next cluster the N documents into M* clusters based on the reduced dimensionality 
pattern matrix, F'd- We again apply the K -Means clustering algorithm, with the normalized cosine-measure as the 
dissimilarity measure between two feature vectors, as the clustering algorithm. 



4.3 Using User Access Information 



With time, server logs with user requests become available to the server. These are then used to refine the semantic 
clustering performed above. 

Define a new pattern matrix, Hd ~ [(*>{? VqGd]nx(m , +k), where 



Gd{i, j ) : 






E dm6c . 



£L-i Em-1 + I h(dm,di)] 



<i <N, \ <j <K, 



and 

oj F (i) = V'l — or 2 (t), 1 < * < AT, 
u>g(i) = <*(£), 1 < i < Ny 



lk(di , d m ) is the number of transitions from di to d m in user session k. a(t) is a parameter which monotonically 
increases with time, t. As more and more server log data become available, a(t) is increased to give a higher weight to 
user access information and a smaller weight to the semantic content in the HTML documents. Our current implemen- 
tation does not change the value of a(t) with time, since the server and document set are fixed (we are post-processing 
historical data). We have empirically chosen a(t) = 0.7. G\D(i,i) is the normalized count of the number of transitions 
from document di to documents in cluster Cj and all transitions into di if di € Cj. The new pattern matrix is then 
subjected to the K-means clustering algorithm in order to derive a fresh set of clusters. This process is repeated in an 
iterative fashion. 

It is expected that the incremental inclusion of user session information will improve the predictive performance. 
As discussed in the next section, this does turn out to be true, though not to the extent expected. 



4.4 Experimental Results 



The first step in the algorithm is the clustering of words into reduced feature vectors. Here are some statistics relevant 
to the word clustering process: 



Number of words after stemming and removing stop words 
Number of words after removing words occurring in single documents 
Number of words after combining words always occurring together 



19,227 
13, 105 
1 1 , 002 



Fig. 1(a) illustrates the result of the word clustering step in the algorithm. The vast majority of clusters have word 
counts of around 20 words as expected, given the partitioning of 1 1 , 002 words into 500 clusters. 

While some clusters were compact (the intra-cluster error was less) other clusters had large intra-cluster errors. 
Members that are at distances greater than a chosen threshold could be treated as outliers and reclustered separately. 
As future work, we plan on performing these experiments. 

The next step of the algorithm performs document clustering using the reduced-size word feature vectors. This is 
semantic clustering only. The number of unique HTML English documents considered was 5, 841 and the value of K 
was set to 500. The results of this process are summarized below: 



WEEKS -> 


Week 3 Week 7 


Number of transitions 
Penalty 

ICT(500) sm 


852, 125 14,892,627 

761,069 14,227,220 

0.89 0.95 



The term penalty refers to the number of out-of-cluster transitions. These results indicate that semantic clusters do 
not appear to be effective predictors of user trails. In week 3 for example, only 1 1% of all transitions are to pages that 
are semantically related. Most of the clusters, as seen in the histogram in Fig. 1(b), are reasonably sized. 
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Figure 1: Histogram of cluster sizes 

The final step in the process involves the application of the user session information. This is an iterative process, 
in that the clustering is based on the participation of documents in initial clusters. If the algorithm performs properly, 
we expect to see decreasing ICT values for each iteration. The algorithm was applied to the test data. The results are 
tabulated below: 



WEEKS -> 


Week 3 


Week 7 


ITERATION 1 
ITERATION 2 
ITERATION 3 
ITERATION 4 


Penalty : 613, 490, JCT(500)si = 0.720 
Penalty : 598, 703, JCT(500)si = 0.700 
Penalty : 590, 950, /CT(500)si = 0.693 
Penalty : 588, 814, ICT(500)sl = 0.691 


Penalty : 14, 1 19, 968, ICT(500)sl = 0.948 
Penalty : 13, 873, 159, ICT(500)sl = 0.931 
Penalty : 13,774, 633, ICT(500)sl = 0.925 
Penalty: 13, 861, 1 16, /CT(500)si = 0.931 



As seen above, repeated iterations do not considerably reduce the penalty and the predictive power of the method 
does not seem to be that useful. 

5 Method 2: Hierarchical Trail Clustering 

Method 1 proved to be complex computationally due to the iterative application of the K-means algorithm and the 
method produced rather disappointing results. Thus, we used what we had learned about the structure of the data 
to develop another method that does not use the semantic information contained in documents for site clustering. 
Instead, it uses only the trail information. This method uses the hierarchical clustering technique on a proximity matrix 
generated from the user trail information. In order to perform hierarchical clustering, a similarity (or dissimilarity) 
measure between documents must be defined. A crucial part of the hierarchical method involves updating the similarity 
measures after combining two documents or two clusters. For this application we chose the single-link technique 
which, unlike other techniques available, helps combine nodes that occur in a trail. The clustering routine generates a 
dendogram, which indicates the clusters and their components at various thresholds ([Jain and Dubes, 1988]). 

5.1 Similarity Measure 

In order to compute a similarity measure, the transition matrix was used. Each entry in the transition matrix indicates 
the number of transitions between pairs of documents, i.e., entry Uj, in the N x N transition matrix T - {U t j}, 
denotes the total number of transitions from document i to document j as observed from the access log. Clearly, the 
matrix T need not be symmetrical. However, it is an indicator of the pair-wise similarity between documents. The 
transition matrix is used to generate the similarity matrix 5 = {Sij}, which is a N x N symmetrical matrix with 
entry Sij indicating the similarity value between documents i and j . We present below the technique that was used to 
transform T into 5. 
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Figure 2: Performance of the Hierarchical Clustering Technique 



5.1.1 Preprocessing I 



In order to offset the noise introduced by long trails (effect of proxy servers and/or web spiders), trails of size greater 
than 40 were eliminated. After preprocessing, only 866 of the 5841 documents had at least a single non-zero entry in 
the week 3 transition matrix. Thus the first set of experiments operated on a 866 x 866 transition matrix representing 
759, 640 transitions between the documents. 

To compute the similarity matrix 5, the matrices T and T f were used - T' being the transpose of T. While each 
entry Uj in T denotes the number of transitions from document i to document j , each entry t\^ in T 1 denotes the 
number of transitions to document i from document j. Clearly, these values are identical prior to normalization. 



1 . Step 1 : As a first step, all rows of T and T' are normalized as follows: 



*1 i,j 






fhu - 






2. Step 2: The similarity matrix is computed as follows: 
Si, j max{min{ii minify } } 



5.1.2 Preprocessing II 

One would intuitively expect the method described in the previous section to result in reasonably sized clusters. 
However, results indicate (see Fig. 2(b)) that the single link technique operating on the similarity matrix constructed 
by this method has the tendency to form a single large cluster. In order to find out the reason for this anomaly the 
transition matrix was examined in greater detail. It was observed that only 287 of the 866 documents were involved 
in at least 100 transitions. Thus we used the following heuristic to prune the size of the transition matrix and alter it’s 
entries: only those documents that were involved in at least a 100 different transitions (either into them or out of them) 
were considered. The transition matrix eventually considered 287 documents and had a total of 749, 270 transitions. 

This method gave the best clustering results. Fig. 3(a) shows the fraction of out-of-cluster transitions as the number 
of clusters is varied by cutting the dendogram at various points. Fig. 3(b) shows the number of documents in each 
cluster when the fraction of out-of-cluster transitions is approximately 0.3. An ICT value of 0.3 indicates that 70% of 
all transistions were predicted by the clustering. In the figure it can be seen that 70% prediction is accomplished with 
less than 120 clusters, with a maximum cluster size of 32. Average cluster size is much smaller. There is a tendency 
for the clustering to emit a larger number of small clusters. We are examining ways to recluster this emitted set in 
order to form more uniform groupings of the site content. 



6 Conclusion and Future Work 

The result summarized above needs to be verified on the entire 14 weeks of the data set. We are examining this in 
several ways. Currently we are using the entire data set for both training and testing. We plan to partition that data set 
into training and testing to verify that this method is not resulting in over-fitting to the data. 
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(a) Penalty at Various Points in Dendogram 


(b) Cluster Sizes when ICT=0.3 
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Figure 3: Performance of the Hierarchical Clustering Technique after considering only those documents in- 
volved in atleast a 100 transitions 

The clustering method presented in this paper necessarily produces non-overlapping clusters. While this is con- 
sidered to be effective in many cases, there may be cases where common nodes exist in trails. This is particularly true 
if central index nodes are expected to be elements of the user trails. We plan to incorporate a technique that would 
facilitate generation of overlapping clusters. 

The apparent ineffectiveness of semantic methods does leave the document routing problem as an open issue in 
this work. If new documents cannot be selected by semantic characteristics, it is not certain how they can be routed to 
initial clusters. As presented, the techniques in this report will ignore new documents until sufficient user history has 
been developed. 
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